# %load_ext pretty_jupyter

import sqlite3
import pandas as pd
import cpi
from itertools import product
from IPython.display import display_html
import statsmodels.formula.api as smf

def custom_formatter(value):
    ''' Helper function for formatting numbers in tables '''
    return '{:,.1%}'.format(value) if type(value) != str else value

# cpi.update()
cpi_data = {year: cpi.inflate(1, year) for year in range(1970, 2023)}

def inflate(money):
    ''' Helper function because cpi function is slow '''
    return money*pd.Series(data=money.index, index=money.index).map(cpi_data)

# Database connection for subsetting data below
conn = sqlite3.connect('ccd_db/state/data/state.db')
cursor = conn.cursor()

# Helper mapping for making table displays prettier/more readable.
state_names = pd.read_html('https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code')[0][['Name', 'Alpha code']]
state_name_mapping = state_names.set_index('Alpha code')['Name'].to_dict()

0. Skills/Tools

  • Data cleaning/python scripting. Packages: pandas, requests, sqlite3
  • SQL database creation and querying. Engine: sqlite3
  • Data visualization. Software: Tableau
  • Statistical modeling

1. Introduction

I've written two reports (here and here) analyzing academic achievement and education spending data for the state of Utah. I found that there was essentially no evidence of changes in spending affecting academic achievement and became curious if this generalized to the entire US. TL;DR: There is a very, very weak positive relationship between per-student, inflation-adjusted spending and education efficacy. Two academic subjects were examined (math and reading) and only math exhibited this relationship.

Data Used:

There were two primary datasets that were utilized.

NAEP administers a variety of tests at different grade levels, subjects, and other parameters but to simplify analysis, only 8th grade results in math and reading were used. In some of the visualizations of the NAEP data below, there is a distinction between NAEP scores "with accommodations" and scores "without accommodations". In 1996, NAEP began to introduce accommodations for students with disabilities and English learners and these results were reported alongside those without accommodations. In 2002, testing without accommodations was stopped and only results with accommodations were reported thereafter. For more information, see here

All this data was downloaded, cleaned, organized, and compiled into a SQL database where is queried in this report for analysis.

2. Academic Trends

Each state has it's own public education journey but there are some general trends that can be easily observed. NAEP math scores in almost all states were going up until approximately 2013. However, after 2013 math scores began to decline. NAEP reading scores were largely flat until 2013 after which scores began to decline. The declines have been so great that the median NAEP math score (273) is now were it was in 2000, erasing years of gains. And the median NAEP reading score (257) is the lowest since the reading tests were first administered in 1998. Visualized here:

# Query obtaining math and reading scores.
cursor.execute('''
    SELECT
        end_year,
        jurisdiction,
        math_read as subject,
        accommodations,
        mean as avg_naep
    FROM
        naep
    WHERE
        grade = 8
        AND jurisdiction NOT LIKE 'X%'
        AND jurisdiction NOT IN ('NT', 'NP', 'NR', 'AS', 'GU', 'VI', 'YA',
                                 'PR', 'DS', 'NL', 'DC')
    ;
''')

naep = pd.DataFrame(cursor.fetchall(),
                    columns=[column[0] for column in cursor.description])

naep.to_csv('naep.csv', index=False)
%%html
<div class='tableauPlaceholder' id='viz1743784225298' style='position: relative'><noscript><a href='#'><img alt='NAEP Scores over Time by State8th Grade. R2 median in green. R3 median in red. ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;NA&#47;NAEPScoresoverTimebyState&#47;NAEPScoresoverTimebyState&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='NAEPScoresoverTimebyState&#47;NAEPScoresoverTimebyState' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;NA&#47;NAEPScoresoverTimebyState&#47;NAEPScoresoverTimebyState&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1743784225298');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
<br>

2003 was the first year that NAEP tests with accommodations were administered to all states in reading and math. If we look at the percentage change in NAEP scores by state, we begin to see that in reading and math only 3 states for each subject saw improvements of over 1% in either test. So top performers over this period are more cases of "not failing" rather than improving.

Top/Bottom 5 % Growth in NAEP Math Since 2003 (with accommodations)

# Query obtaining top/bottom states by % change in NAEP math scores.
cursor.execute('''
    SELECT
        end_year,
        jurisdiction,
        mean as avg_naep
    FROM
        naep
    WHERE
        math_read = 'MAT'
        AND accommodations = 'R3'
        AND grade = 8
        AND jurisdiction NOT LIKE 'X%'
        AND jurisdiction NOT IN ('NT', 'NP', 'NR', 'AS', 'GU', 'VI', 'YA',
                                 'PR', 'DS', 'NL', 'DC')
    ;
''')

naep_math = pd.DataFrame(cursor.fetchall(),
                         columns=[column[0] for column in cursor.description])

# Calculate change over time span that we have R3, grade 8 scores (since 2003)
naep_math_pivot = naep_math.pivot(columns=['end_year'],
                                  index=['jurisdiction'],
                                  values='avg_naep')
naep_math_change = (
    naep_math_pivot[2024]
    .div(naep_math_pivot[2003])
    .subtract(1)
    .sort_values(ascending=False)
    .rename('Percent Change')
)
naep_math_top = pd.DataFrame(naep_math_change.head()).reset_index()
naep_math_bottom = pd.DataFrame(naep_math_change.tail()).reset_index()

# Formatting for table display
top_math_styler = (
    naep_math_top
    .replace({'jurisdiction': state_name_mapping})
    .rename(columns={'jurisdiction': 'State'})
    .style
    .set_table_attributes("style='display:inline; width: 100%;'")
    .set_caption('Top 5 - % Growth in NAEP Math')
    .set_table_styles([
        {'selector': 'caption',
         'props': [('white-space', 'nowrap'), ('text-align', 'left'), ('font-size', '16px')]}
    ])
    .hide()
    .format({'Percent Change': '{:,.1%}'.format}, precision=2)
)
bottom_math_styler = (
    naep_math_bottom
    .replace({'jurisdiction': state_name_mapping})
    .rename(columns={'jurisdiction': 'State'})
    .style
    .set_table_attributes("style='display:inline;'")
    .set_table_styles([
        {'selector': 'caption',
         'props': [('white-space', 'nowrap'), ('text-align', 'left'), ('font-size', '16px')]}
    ])
    .set_caption('Bottom 5 - % Growth in NAEP Math')
    .hide()
    .format({'Percent Change': '{:,.1%}'.format}, precision=2)
)


display_html(top_math_styler._repr_html_()+10*'&emsp;'+bottom_math_styler._repr_html_()+'<br>', raw=True)
Top 5 - % Growth in NAEP Math
State Percent Change
Mississippi 3.1%
Tennessee 2.9%
Hawaii 1.6%
California 0.6%
Utah 0.4%
          
Bottom 5 - % Growth in NAEP Math
State Percent Change
Kansas -3.5%
West Virginia -3.7%
Oregon -4.6%
Delaware -5.1%
Alaska -5.4%

Top/Bottom 5 % Growth in NAEP Reading Since 2003 (with accommodations)

# Query obtaining top/bottom states by % change in NAEP reading scores.
cursor.execute('''
    SELECT
        end_year,
        jurisdiction,
        mean as avg_naep
    FROM
        naep
    WHERE
        math_read = 'RED'
        AND accommodations = 'R3'
        AND grade = 8
        AND jurisdiction NOT LIKE 'X%'
        AND jurisdiction NOT IN ('NT', 'NP', 'NR', 'AS', 'GU', 'VI', 'YA',
                                 'PR', 'DS', 'NL', 'DC')
    ;
''')

naep_read = pd.DataFrame(cursor.fetchall(),
                    columns=[column[0] for column in cursor.description])

# Calculate change over time span that we have R3, grade 8 scores (since 2003)
naep_read_pivot = naep_read.pivot(columns=['end_year'],
                                  index=['jurisdiction'],
                                  values='avg_naep')
naep_read_change = (
    naep_read_pivot[2024]
    .div(naep_read_pivot[2003])
    .subtract(1)
    .sort_values(ascending=False)
    .rename('Percent Change')
)
naep_read_top = pd.DataFrame(naep_read_change.head()).reset_index()
naep_read_bottom = pd.DataFrame(naep_read_change.tail()).reset_index()

# Formatting for table display
top_reading_styler = (
    naep_read_top
    .replace({'jurisdiction': state_name_mapping})
    .rename(columns={'jurisdiction': 'State'})
    .style
    .set_table_attributes("style='display:inline'")
    .set_caption('Top 5 - % Growth in NAEP Reading')
    .set_table_styles([
        {'selector': 'caption',
         'props': [('white-space', 'nowrap'), ('text-align', 'left'), ('font-size', '16px')]}
    ])
    .hide()
    .format({'Percent Change': '{:,.1%}'.format}, precision=2)
)
bottom_reading_styler = (
    naep_read_bottom
    .replace({'jurisdiction': state_name_mapping})
    .rename(columns={'jurisdiction': 'State'})
    .style
    .set_table_attributes("style='display:inline'")
    .set_caption('Bottom 5 - % Growth in NAEP Reading')
    .set_table_styles([
        {'selector': 'caption',
         'props': [('white-space', 'nowrap'), ('text-align', 'left'), ('font-size', '16px')]}
    ])
    .hide()
    .format({'Percent Change': '{:,.1%}'.format}, precision=2)
)

display_html(top_reading_styler._repr_html_()+10*'&emsp;'+bottom_reading_styler._repr_html_(), raw=True)
Top 5 - % Growth in NAEP Reading
State Percent Change
Hawaii 2.4%
California 1.3%
Louisiana 1.2%
Georgia 0.6%
Nevada 0.4%
          
Bottom 5 - % Growth in NAEP Reading
State Percent Change
West Virginia -4.7%
Vermont -4.9%
Oklahoma -4.9%
Maine -4.9%
Delaware -5.7%

3. Spending Trends

Similar to the academic trends observed above, spending trends are roughly the same across states. We can see in the graph below that almost all states saw an increase in the amount of per student spending from 1987 to 2009, a decline from 2009 to 2013, and a subsequent continuation of the increase in spending.

# Query obtaining spending per student per state.
cursor.execute('''
    SELECT
        fiscal.end_year,
        fiscal.te11 / membership.student_count as exp_per_stu,
        fiscal.stabr
    FROM
        fiscal
    INNER JOIN
        membership
        ON fiscal.end_year = membership.end_year
        AND fiscal.fipst = membership.fipst
    WHERE 
        membership.total_indicator = 'Derived - Education Unit Total minus Adult Education Count'
        AND fiscal.stabr NOT IN ('GU', 'VI','AS', 'PR', 'DC', 'MP')
    ;
''')

fiscal = pd.DataFrame(cursor.fetchall(),
                      columns=[column[0] for column in cursor.description])

# Adjust spending for inflation
fiscal['exp_ia'] = fiscal['exp_per_stu'] * fiscal['end_year'].map(cpi_data)
fiscal = fiscal.drop(columns=['exp_per_stu'])

# Print fiscal table to file for distribution of spending over time line chart viz.
# fiscal.to_csv('fiscal.csv', index=False)
%%html
<div class='tableauPlaceholder' id='viz1743461013621' style='position: relative'><noscript><a href='#'><img alt='Spending per Student over TimeMedian in red ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Sp&#47;SpendingperStudentoverTimebyState&#47;SpendingperStudentoverTime&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='SpendingperStudentoverTimebyState&#47;SpendingperStudentoverTime' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Sp&#47;SpendingperStudentoverTimebyState&#47;SpendingperStudentoverTime&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1743461013621');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
<br>

I would like to note that these spending figures are inflation adjusted to 2025 dollars and that these are per-student spending amounts. In light of this, I think it's startling just how great the increases in spending are.

Top/Bottom 5 in % Change in Spending Since 1987

# Table showing top and bottom 5 % change since 1987

fiscal_pivot = fiscal.pivot(columns=['end_year'], index=['stabr'], values='exp_ia')
fiscal_change = pd.DataFrame({
    'Since 1987': (fiscal_pivot[2022].div(fiscal_pivot[1987]) - 1),
    'Since 2003': (fiscal_pivot[2022].div(fiscal_pivot[2003]) - 1),
})

fiscal_1987_change = (
    fiscal_change
    .sort_values(by='Since 1987', ascending=False)
    .reset_index()
    .loc[:, ['stabr', 'Since 1987']]
)

fiscal_1987_top = fiscal_1987_change.head()
fiscal_1987_bottom = fiscal_1987_change.tail()

# Formatting for table display
fiscal_1987_top_styler = (
    fiscal_1987_top
    .replace({'stabr': state_name_mapping})
    .rename(columns={'stabr': 'State', 'Since 1987': 'Percent Change'})
    .style
    .set_table_attributes("style='display:inline'")
    .set_caption('Top 5 - % Change in Per Student Spending')
    .set_table_styles([
        {'selector': 'caption',
         'props': [('white-space', 'nowrap'), ('text-align', 'left'), ('font-size', '16px')]}
    ])
    .hide()
    .format({'Percent Change': '{:,.1%}'.format}, precision=2)
)
fiscal_1987_bottom_styler = (
    fiscal_1987_bottom
    .replace({'stabr': state_name_mapping})
    .rename(columns={'stabr': 'State', 'Since 1987': 'Percent Change'})
    .style
    .set_table_attributes("style='display:inline'")
    .set_caption('Bottom 5 - % Change in Per Student Spending')
    .set_table_styles([
        {'selector': 'caption',
         'props': [('white-space', 'nowrap'), ('text-align', 'left'), ('font-size', '16px')]}
    ])
    .hide()
    .format({'Percent Change': '{:,.1%}'.format}, precision=2)
)

display_html(fiscal_1987_top_styler._repr_html_()+'&emsp;'+'&emsp;'+'&emsp;'+fiscal_1987_bottom_styler._repr_html_(), raw=True)
Top 5 - % Change in Per Student Spending
State Percent Change
Vermont 146.9%
Kentucky 136.3%
Illinois 133.0%
New Hampshire 125.6%
North Dakota 122.8%
   
Bottom 5 - % Change in Per Student Spending
State Percent Change
Montana 56.3%
Nevada 54.7%
Florida 44.7%
Arizona 37.3%
Alaska 19.3%

Top/Bottom 5 in % Change in Spending Since 2003

Since the NAEP scores for NAEP tests with accommodations only exist from 2003 and after, here are the same tables but calculated since 2003:

# Table showing top and bottom 5 % change since 2003
fiscal_2003_change = (
    fiscal_change
    .sort_values(by='Since 2003', ascending=False)
    .reset_index()
    .loc[:, ['stabr', 'Since 2003']]
)

fiscal_2003_top = fiscal_2003_change.head()
fiscal_2003_bottom = fiscal_2003_change.tail()

# Formatting for table display
fiscal_2003_top_styler = (
    fiscal_2003_top
    .replace({'stabr': state_name_mapping})
    .rename(columns={'stabr': 'State', 'Since 2003': 'Percent Change'})
    .style
    .set_table_attributes("style='display:inline'")
    .set_caption('Top 5 - % Change in Per Student Spending')
    .set_table_styles([
        {'selector': 'caption',
         'props': [('white-space', 'nowrap'), ('text-align', 'left'), ('font-size', '16px')]}
    ])
    .hide()
    .format({'Percent Change': '{:,.1%}'.format}, precision=2)
)
fiscal_2003_bottom_styler = (
    fiscal_2003_bottom
    .replace({'stabr': state_name_mapping})
    .rename(columns={'stabr': 'State', 'Since 2003': 'Percent Change'})
    .style
    .set_table_attributes("style='display:inline'")
    .set_caption('Bottom 5 - % Change in Per Student Spending')
    .set_table_styles([
        {'selector': 'caption',
         'props': [('white-space', 'nowrap'), ('text-align', 'left'), ('font-size', '16px')]}
    ])
    .hide()
    .format({'Percent Change': '{:,.1%}'.format}, precision=2)
)

display_html(fiscal_2003_top_styler._repr_html_()+'&emsp;'+'&emsp;'+'&emsp;'+fiscal_2003_bottom_styler._repr_html_()+'<br>', raw=True)
Top 5 - % Change in Per Student Spending
State Percent Change
Vermont 49.8%
North Dakota 48.1%
Washington 45.3%
New York 41.8%
Hawaii 39.7%
   
Bottom 5 - % Change in Per Student Spending
State Percent Change
Georgia 4.4%
Nevada 3.5%
Michigan 2.0%
Idaho -1.6%
Indiana -5.0%

4. Visual Comparison of Trends

Inspecting the top/bottom 5 tables above, besides Hawaii, no top spending growth state is a top academic improver and no bottom spending growth state is a bottom academic improver. This begins to suggest that perhaps there may not be a relationship between spending and academic success.

In the graphic below, we get some more clues that the relationship is tenuous. The big takeaway from this visualization is that no matter what settings are used for time span or subject, the two maps don't look similar. That is to say, it doesn't appear that states that increased spending more have increased their test scores.

# Queries obtaining data to generate map viz of NAEP and spending % change.

#####################
### NAEP % CHANGE ###
#####################
cursor.execute('''
    SELECT
        end_year,
        jurisdiction,
        math_read as subject,
        mean as avg_naep
    FROM
        naep
    WHERE
        accommodations = 'R3'
        AND grade = 8
        AND jurisdiction NOT LIKE 'X%'
        AND jurisdiction NOT IN ('NT', 'NP', 'NR', 'AS', 'GU', 'VI', 'YA',
                                 'PR', 'DS', 'NL', 'DC')
    ;
''')

naep_map = pd.DataFrame(cursor.fetchall(),
                    columns=[column[0] for column in cursor.description])

naep_map_pivot = naep_map.pivot(index=['jurisdiction', 'subject'],
                            columns='end_year',
                            values='avg_naep')

naep_map_change = (
    naep_map_pivot[2024]
    .div(naep_map_pivot[2003])
    .subtract(1)
    .rename('Percent Change')
)

naep_map_change.to_csv('naep_change.csv')

#########################
### Spending % CHANGE ###
#########################

cursor.execute('''
    SELECT
        fiscal.end_year,
        fiscal.te11 / membership.student_count as exp_per_stu,
        fiscal.stabr
    FROM
        fiscal
    INNER JOIN
        membership
        ON fiscal.end_year = membership.end_year
        AND fiscal.fipst = membership.fipst
    WHERE 
        membership.total_indicator = 'Derived - Education Unit Total minus Adult Education Count'
        AND fiscal.stabr NOT IN ('GU', 'VI','AS', 'PR', 'DC', 'MP')
    ;
''')

fiscal_map = pd.DataFrame(cursor.fetchall(),
                          columns=[column[0] for column in cursor.description])

# Adjust spending for inflation
fiscal_map['exp_ia'] = fiscal_map['exp_per_stu'] * fiscal_map['end_year'].map(cpi_data)
fiscal_map = fiscal_map.drop(columns=['exp_per_stu'])

fiscal_map_pivot = fiscal_map.pivot(columns=['end_year'], index=['stabr'], values='exp_ia')
fiscal_map_change = pd.DataFrame({
    'Since 1987': (fiscal_map_pivot[2022].div(fiscal_map_pivot[1987]) - 1),
    'Since 2003': (fiscal_map_pivot[2022].div(fiscal_map_pivot[2003]) - 1),
})

(
    fiscal_map_change
    .reset_index()
    .melt(id_vars='stabr',
          value_vars=['Since 1987', 'Since 2003'],
          value_name='Percent Change in Spending',
          var_name='Timespan')
    .to_csv('fiscal_change.csv', index=False)
)
%%html
<div class='tableauPlaceholder' id='viz1744061276835' style='position: relative'><noscript><a href='#'><img alt='Dashboard 1 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ma&#47;Mapsofedchanges&#47;Dashboard1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Mapsofedchanges&#47;Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ma&#47;Mapsofedchanges&#47;Dashboard1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1744061276835');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='900px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='1127px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='900px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='1127px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else { vizElement.style.width='100%';vizElement.style.height='1027px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
<br>

For the curious reader, the following combo-graph can be used to inspect spending and NAEP score changes for each individual state:

# Table generated to display line graphs together and to place null values
# where appropriate to make graphs pretty.

cursor.execute('''
       SELECT
              fiscal.end_year,
              SUM(fiscal.te11) / SUM(membership.student_count) as avg_exp_per_stu,
              fiscal.stabr as jurisdiction
       FROM
              fiscal
       INNER JOIN
              membership
       ON
              fiscal.end_year = membership.end_year
       AND
              fiscal.fipst = membership.fipst
       WHERE
              membership.total_indicator = 'Derived - Education Unit Total minus Adult Education Count'
       GROUP BY
              fiscal.end_year, fiscal.fipst
       ;
''')

fiscal = pd.DataFrame(cursor.fetchall(),
                      columns=[column[0] for column in cursor.description])

# Calculate national expenditures per student (coded as NT)
cursor.execute('''
       SELECT
              fiscal.end_year,
              SUM(fiscal.te11) / SUM(membership.student_count) as avg_exp_per_stu,
              'NT' as jurisdiction
       FROM
              fiscal
       INNER JOIN
              membership
       ON
              fiscal.end_year = membership.end_year
       AND
              fiscal.fipst = membership.fipst
       WHERE
              membership.total_indicator = 'Derived - Education Unit Total minus Adult Education Count'
       GROUP BY
              fiscal.end_year
       ;
''')

fiscal_nt = pd.DataFrame(cursor.fetchall(),
                         columns=[column[0] for column in cursor.description])

fiscal = pd.concat([fiscal, fiscal_nt])

# Adjust for inflation
fiscal['exp_per_stu_ia'] = fiscal['avg_exp_per_stu'] * fiscal['end_year'].map(cpi_data)
fiscal = fiscal.drop(columns=['avg_exp_per_stu'])

# Obtain table of NAEP Scores.
cursor.execute('''
    SELECT
        end_year,
        jurisdiction,
        math_read as subject,
        accommodations,
        mean as avg_naep
    FROM
        naep
    WHERE
        grade = 8
        AND jurisdiction NOT LIKE 'X%'
        AND jurisdiction NOT IN ('NP', 'NR', 'AS', 'GU', 'VI', 'YA',
                                 'PR', 'DS', 'NL', 'DC')
    ;
''')

naep = pd.DataFrame(cursor.fetchall(),
                    columns=[column[0] for column in cursor.description])

filters = [range(1987, 2025),
           naep['jurisdiction'].unique(),
           ['R2', 'R3'],
           ['MAT', 'RED']]

filter_product = pd.DataFrame(list(product(*filters)),
                    columns=['end_year', 'jurisdiction', 'accommodations', 'subject'])

# filter_product
naep_exp_with_null = (
   filter_product
   .merge(fiscal,
          on=['end_year', 'jurisdiction'],
          how='left')
   .merge(naep,
          on=['end_year', 'jurisdiction', 'subject', 'accommodations'],
          how='left')
)

naep_exp_with_null['jurisdiction'] = (
       naep_exp_with_null['jurisdiction']
       .replace({'NT': 'National Average'})
)

naep_exp_with_null.to_csv('naep_exp_with_null.csv', index=False)
%%html
<div class='tableauPlaceholder' id='viz1744057974228' style='position: relative'><noscript><a href='#'><img alt='Dashboard 2 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Sp&#47;SpendingandNAEPperstatedashboard&#47;Dashboard2&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='SpendingandNAEPperstatedashboard&#47;Dashboard2' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Sp&#47;SpendingandNAEPperstatedashboard&#47;Dashboard2&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1744057974228');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='800px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='1027px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='800px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='1027px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else { vizElement.style.width='100%';vizElement.style.height='827px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
<br>

5. Models

The visualizations above offer some clues, suggesting that there may not be a terribly strong relationship between per-student spending and NAEP scores. To further inspect the data, let's fit some linear models to see if there's any evidence of a relationship that can't be seen in visualizations.

# Query to generate table used for models.

cursor.execute('''
    WITH fiscal_cte as (
        SELECT
            fiscal.end_year,
            fiscal.te11 as total_exp,
            membership.student_count as total_mem,
            SUM(fiscal.te11) / SUM(membership.student_count) as exp_per_stu,
            fiscal.fipst,
            fiscal.stabr
        FROM fiscal
        INNER JOIN membership
        ON fiscal.end_year = membership.end_year
        AND fiscal.fipst = membership.fipst
        WHERE membership.total_indicator = 'Derived - Education Unit Total minus Adult Education Count'
        GROUP BY fiscal.end_year, fiscal.fipst
    ),
    naep_cte as (
        SELECT
            end_year,
            math_read,
            mean as avg_naep,
            accommodations,
            jurisdiction
        FROM
            naep
        WHERE
            grade = 8
    )
    SELECT
        COALESCE(naep_cte.end_year, fiscal_cte.end_year) as end_year,
        COALESCE(naep_cte.jurisdiction, fiscal_cte.stabr) as state,
        naep_cte.math_read,
        naep_cte.avg_naep,
        naep_cte.accommodations,
        fiscal_cte.exp_per_stu
        --fiscal_cte.total_exp,
        --fiscal_cte.total_mem
    FROM fiscal_cte
    FULL JOIN naep_cte
    ON fiscal_cte.end_year = naep_cte.end_year
    AND fiscal_cte.stabr = naep_cte.jurisdiction
    WHERE
        naep_cte.jurisdiction NOT LIKE 'X%' 
        AND naep_cte.jurisdiction NOT IN ('NT', 'NP', 'NR', 'AS', 'GU', 'VI', 'YA',
                                 'PR', 'DS', 'NL', 'DC')
    ;
''')

df = pd.DataFrame(cursor.fetchall(),
                              columns=[column[0] for column in cursor.description])

# Adjust for inflation
df['exp_per_stu_ia'] = (
    df['exp_per_stu'] * df['end_year'].map(cpi_data)
)

df = df.drop(columns=['exp_per_stu'])

# Create z-scores for future plotting/analysis
df['z_exp_per_stu_ia'] = df.groupby('state')['avg_naep'].transform(lambda x: (x - x.mean()) / x.std())

df.to_csv('naep_fiscal_total.csv', index=False)

First, let's use a simple linear model, notated as: $$ NAEPscore_{i,j} \sim spending_{i,j} + accom_{i,j} + subject_{i,j} $$ where $i = state$ and $j=year$.

# VERY SIMPLE OLS MODEL

# Reduce scale of expenditure per student.
df['exp_per_stu_ia_scaled'] = df['exp_per_stu_ia'] / df['exp_per_stu_ia'].mean()

model = smf.ols('avg_naep ~ exp_per_stu_ia_scaled + math_read + accommodations', data=df).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               avg_naep   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.578
Method:                 Least Squares   F-statistic:                     597.8
Date:                Sat, 12 Apr 2025   Prob (F-statistic):          2.10e-244
Time:                        09:30:32   Log-Likelihood:                -4403.2
No. Observations:                1309   AIC:                             8814.
Df Residuals:                    1305   BIC:                             8835.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept               262.9744      0.789    333.209      0.000     261.426     264.523
math_read[T.RED]        -15.8036      0.399    -39.595      0.000     -16.587     -15.021
accommodations[T.R3]      6.5232      0.595     10.970      0.000       5.357       7.690
exp_per_stu_ia_scaled     9.7450      0.766     12.719      0.000       8.242      11.248
==============================================================================
Omnibus:                       23.987   Durbin-Watson:                   1.259
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               22.914
Skew:                          -0.286   Prob(JB):                     1.06e-05
Kurtosis:                       2.697   Cond. No.                         9.33
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This shows that there is a positive relationship between spending and NAEP scores: the estimated coefficient of the spending per student term, exp_per_stu_ia_scaled1, is positive with p-value $<.0001$. And this tracks with the following scatter plots. You can see that at every combination of subject and accommodation there's a positive relationship between per-student spending and NAEP scores.

%%html
<div class='tableauPlaceholder' id='viz1744151814181' style='position: relative'><noscript><a href='#'><img alt='NAEP vs Spending- across subject and accommodations ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;NA&#47;NAEPvsSpendingpanel&#47;NAEPvsSpending&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='NAEPvsSpendingpanel&#47;NAEPvsSpending' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;NA&#47;NAEPvsSpendingpanel&#47;NAEPvsSpending&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1744151814181');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
<br>

But things may not be as they seem! This does not account for the differences between states. Let's standardize NAEP scores for each state then plot those against spending per student.

%%html
<div class='tableauPlaceholder' id='viz1744151649235' style='position: relative'><noscript><a href='#'><img alt='Standardized NAEP vs Spending- across subject and accommodations ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;St&#47;StandardizedNAEPvsSpendingpanel&#47;StandardizedNAEPvsSpending&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='StandardizedNAEPvsSpendingpanel&#47;StandardizedNAEPvsSpending' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;St&#47;StandardizedNAEPvsSpendingpanel&#47;StandardizedNAEPvsSpending&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1744151649235');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
<br>

Now the positive relationship seems to have greatly diminished, if not all but disappeared. That is, after accounting for differences between states, there is no longer an obvious positive association between per-student spending and NAEP scores. Let's add a term for the effect of state (a fixed effect for state) to the model above:

$$ NAEPscore_{i,j} \sim spending_{i,j} + accom_{i,j} + subject_{i,j} + F_{i} $$

where $i = state$, $j=year$, and $F_{i}$ is a fixed effect for state.

# MODEL WITH STATE AS FIXED EFFECT
# One could reason that the fixed effect model should be used for no other reason than
# the states don't come from a distribution; they are the entire population!
from linearmodels.panel import PanelOLS

# Convert to panel data format
df = df.set_index(["state", "end_year"])

# Fixed effect model
model_fe = PanelOLS.from_formula("avg_naep ~ exp_per_stu_ia_scaled + math_read + accommodations + EntityEffects", df.dropna()).fit()
print(model_fe.summary)
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:               avg_naep   R-squared:                        0.7509
Estimator:                   PanelOLS   R-squared (Between):              0.1655
No. Observations:                1309   R-squared (Within):               0.7509
Date:                Sat, Apr 12 2025   R-squared (Overall):              0.5639
Time:                        09:31:24   Log-likelihood                   -3808.0
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      1262.3
Entities:                          50   P-value                           0.0000
Avg Obs:                       26.180   Distribution:                  F(3,1256)
Min Obs:                       20.000                                           
Max Obs:                       28.000   F-statistic (robust):          3.007e+04
                                        P-value                           0.0000
Time periods:                      16   Distribution:                  F(3,1256)
Avg Obs:                       81.812                                           
Min Obs:                       37.000                                           
Max Obs:                      100.000                                           
                                                                                
                                   Parameter Estimates                                   
=========================================================================================
                       Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-----------------------------------------------------------------------------------------
exp_per_stu_ia_scaled     4.6120     1.1104     4.1535     0.0000      2.4336      6.7905
math_read[MAT]            267.47     0.9412     284.19     0.0000      265.63      269.32
math_read[RED]            251.78     0.9596     262.39     0.0000      249.90      253.67
accommodations[T.R3]      7.2039     0.4552     15.824     0.0000      6.3107      8.0970
=========================================================================================

F-test for Poolability: 38.011
P-value: 0.0000
Distribution: F(49,1256)

Included effects: Entity

The term for spending per student is still statistically significant, however the parameter estimate has decreased greatly (from 9.745 to 4.6120) and the T-stat has also decreased a lot (12.719 to 4.1535). Furthermore, the R-squared has gone up a great deal (0.579 to 0.7509) indicating that a large portion of the change in NAEP scores is explained by the fixed effect for state.

Also, notice what happens when the term for per-student spending is removed: R-squared barely goes down.

# Model with no effect for spending.

model_fe_no_spending = PanelOLS.from_formula("avg_naep ~ math_read + accommodations + EntityEffects", df).fit()
print(model_fe_no_spending.summary)
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:               avg_naep   R-squared:                        0.7205
Estimator:                   PanelOLS   R-squared (Between):              0.0505
No. Observations:                1409   R-squared (Within):               0.7205
Date:                Sat, Apr 12 2025   R-squared (Overall):              0.5173
Time:                        09:31:40   Log-likelihood                   -4202.5
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      1748.6
Entities:                          50   P-value                           0.0000
Avg Obs:                       28.180   Distribution:                  F(2,1357)
Min Obs:                       22.000                                           
Max Obs:                       30.000   F-statistic (robust):           2.92e+05
                                        P-value                           0.0000
Time periods:                      17   Distribution:                  F(2,1357)
Avg Obs:                       82.882                                           
Min Obs:                       37.000                                           
Max Obs:                      100.000                                           
                                                                                
                                  Parameter Estimates                                   
========================================================================================
                      Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
----------------------------------------------------------------------------------------
math_read[MAT]           271.12     0.3555     762.66     0.0000      270.43      271.82
math_read[RED]           255.44     0.4138     617.23     0.0000      254.63      256.25
accommodations[T.R3]     7.7738     0.3894     19.964     0.0000      7.0099      8.5377
========================================================================================

F-test for Poolability: 40.793
P-value: 0.0000
Distribution: F(49,1357)

Included effects: Entity

This would suggest that spending per student, while statistically significant in explaining changes in NAEP scores, is not practically significant.

Moreover, if we fit a model for only the reading NAEP scores, we can say that spending doesn't have anything to do with reading scores (p-value = 0.4825). But this was already obvious from the above visualizations.

df_read_r3 = df.loc[(df['math_read'] == 'RED')]
fdsa = PanelOLS.from_formula("avg_naep ~ exp_per_stu_ia_scaled + accommodations + EntityEffects", df_read_r3).fit()

print(fdsa.summary)
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:               avg_naep   R-squared:                        0.0050
Estimator:                   PanelOLS   R-squared (Between):             -0.0305
No. Observations:                 613   R-squared (Within):               0.0050
Date:                Sat, Apr 12 2025   R-squared (Overall):             -0.0207
Time:                        09:31:46   Log-likelihood                   -1538.7
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      1.4087
Entities:                          50   P-value                           0.2453
Avg Obs:                       12.260   Distribution:                   F(2,561)
Min Obs:                      10.0000                                           
Max Obs:                       13.000   F-statistic (robust):          3.096e+04
                                        P-value                           0.0000
Time periods:                      12   Distribution:                   F(2,561)
Avg Obs:                       51.083                                           
Min Obs:                       41.000                                           
Max Obs:                       72.000                                           
                                                                                
                                   Parameter Estimates                                   
=========================================================================================
                       Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-----------------------------------------------------------------------------------------
exp_per_stu_ia_scaled    -0.7820     1.1129    -0.7027     0.4825     -2.9679      1.4039
accommodations[R2]        263.11     1.0654     246.95     0.0000      261.02      265.20
accommodations[R3]        264.09     1.1533     228.99     0.0000      261.82      266.35
=========================================================================================

F-test for Poolability: 28.322
P-value: 0.0000
Distribution: F(49,561)

Included effects: Entity

For the curious, here is the results for the model fit over only the math data:

df_math_r3 = df.loc[(df['math_read'] == 'MAT')]
rewq = PanelOLS.from_formula("avg_naep ~ exp_per_stu_ia_scaled + accommodations + EntityEffects", df_math_r3).fit()

print(rewq.summary)
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:               avg_naep   R-squared:                        0.4736
Estimator:                   PanelOLS   R-squared (Between):              0.1861
No. Observations:                 696   R-squared (Within):               0.4736
Date:                Sat, Apr 12 2025   R-squared (Overall):              0.3130
Time:                        09:32:00   Log-likelihood                   -2054.2
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      289.72
Entities:                          50   P-value                           0.0000
Avg Obs:                       13.920   Distribution:                   F(2,644)
Min Obs:                      10.0000                                           
Max Obs:                       15.000   F-statistic (robust):          2.335e+04
                                        P-value                           0.0000
Time periods:                      14   Distribution:                   F(2,644)
Avg Obs:                       49.714                                           
Min Obs:                       37.000                                           
Max Obs:                       78.000                                           
                                                                                
                                   Parameter Estimates                                   
=========================================================================================
                       Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-----------------------------------------------------------------------------------------
exp_per_stu_ia_scaled     8.2023     1.6415     4.9968     0.0000      4.9789      11.426
accommodations[R2]        263.15     1.3447     195.69     0.0000      260.50      265.79
accommodations[R3]        271.41     1.7230     157.52     0.0000      268.02      274.79
=========================================================================================

F-test for Poolability: 24.583
P-value: 0.0000
Distribution: F(49,644)

Included effects: Entity

6. Conclusion

It was pretty clearly from the visualizations that there wasn't an obvious relationship between per-student spending and NAEP scores. And that was corroborated with the models that were fit. It appears that spending per student isn't greatly correlated with math success and not at all with reading, at least measured by NAEP scores. Personally, I find the history and current trajectory of NAEP reading scores the most alarming finding.

Notes

1 Perhaps you're wondering why the notation for the spending term has the suffix _scaled. Originially, the ols function from statsmodels complained that "The condition number is large, 7.07e+04. This might indicate that there are strong multicollinearity or other numerical problems." I scaled the inflation adjusted per-student spending data by dividing by the mean to shut it up.